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HIGH-DIMENSIONAL GENERALIZED LINEAR MODELS 
AND THE LASSO 

By Sara A. van de Geer 
ETH Zurich 

We consider high-dimensional generalized linear models with Lip- 
schitz loss functions, and prove a nonasymptotic oracle inequality for 
the empirical risk minimizer with Lasso penalty. The penalty is based 
on the coefficients in the linear predictor, after normalization with the 
empirical norm. The examples include logistic regression, density es- 
timation and classification with hinge loss. Least squares regression 
is also discussed. 

1. Introduction. We consider the lasso penalty for high-dimensional gen- 
eralized linear models. Let Y G y C R be a real-valued (response) variable 
and X be a co- variable with values in some space X. Let 

•^=|/e(-) = E^(0^e©} 

be a (subset of a) linear space of functions on X . We let be a convex 
subset of R m , possibly O = R m . The functions {V'fcjfcLi form a given system 
of real- valued base functions on X. 

Let 7/ : X x y — ► R be some loss function, and let {(Xi, Y)}2=i be i.i.d. 
copies of (X, Y). We consider the estimator with lasso penalty 

n :=argmini - V7/ e PQ, *i) + A n /(0) >, 
eee { n ^ J 

where 

m 

/(#):= 5> fc |0 fc | 
fe=l 
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denotes the weighted l\ norm of the vector G R m , with random weights 



Moreover, the smoothing parameter A n controls the amount of complexity 
regularization. 

Indeed, when m is large (possibly m> n), a complexity penalty is needed 
to avoid overfitting. One could penalize the number of nonzero coefficients, 
that is, use an £q penalty, but this leads to a nonconvex optimization prob- 
lem. The lasso is a convex l\ penalty which often behaves similarly as the 
lo penalty. It corresponds to soft thresholding in the case of quadratic loss 
and orthogonal design, see Donoho (1995). In Donoho (2006a), (2006b), 
the agreement of i\- and ^o-sohitions in general (nonorthogonal) systems 
is developed further. We will address the distributional properties of the 
lasso penalty. The acronym "lasso" (least absolute shrinkage and selec- 
tion operator) was introduced by Tibshirani (1996), with further work in 
Hastie, Tibshirani and Friedman (2001). 

Let P be the distribution of (X, Y). The target function / is defined as 



where F ^ J 7 (and assuming for simplicity that there is a unique minimum) . 
We will show that if the target / can be well approximated by a sparse func- 
tion /g* , that is a function /g* with only a few nonzero coefficients B* n k , the 

estimator 9 n will have prediction error roughly as if it knew this sparseness. 
In this sense, the estimator mimics a sparseness oracle. Our results are of the 
same spirit as those in Bunea, Tsybakov and Wegkamp (2007b), which we 
learned about when writing this paper, and which is about quadratic loss. 
We will assume Lipschitz loss (see Assumption L below). Examples are the 
loss functions used in quantile regression, logistic regression, the hinge loss 
function used in classification, and so forth. Quadratic loss can, however, be 
handled using similar arguments, see Example 4. 

Let us briefly sketch our main result. The excess risk of / is 



We will derive a probability inequality for the excess risk £(f§ ) (see Theo- 
rems 2.1 and 2.2). This inequality implies that with large probability, 



Here, the error term Vg will be referred to as "estimation error." It is typi- 
cally proportional to x dim^, where dime := \ {9k ^ 0}|. Moreover, as we 




/ := argminP7j, 
feF 



8{f) :=P lf -P lf . 



£(f§J < const, x mm{£(f e ) + V e }. 
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will see, the smoothing parameter A n can be chosen of order ylog m/n. 
Thus, again typically, the term Vq is of order logm x dimg/n, which is the 
usual form for the estimation error when estimating dim# parameters, with 
the (logm) — term the price to pay for not knowing beforehand which pa- 
rameters are relevant. 

We will study a more general situation, with general margin behavior, that 
is, the behavior of the excess risk near the target (see Assumption B). The 
term Vg will depend on this margin behavior as well as on the dependence 
structure of the features {V'fclfcli (see Assumption C). 

The typical situation sketched above corresponds to "quadratic" margin 
behavior and, for example, a well conditioned inner product matrix of the 
features. 

To avoid digressions, we will not discuss in detail in the case where the 
inner product matrix is allowed to be singular, but only briefly sketch an 
approach to handle this (see Section 3.1). 

There are quite a few papers on estimation with the lasso in high- 
dimensional situations. Bunea, Tsybakov and Wegkamp (2006) and 
Bunea, Tsybakov and Wegkamp (2007b) are for the case of quadratic loss. 
In Tarigan and van de Geer (2006), the case of hinge loss function is con- 
sidered, and adaptation to the margin in classification. Greenshtein (2006) 
studies general loss with l\ constraint. The group lasso is studied and 
applied in Meier, van de Geer and Buhlmann (2008) for logistic loss and 
Dahinden et al. (2008) for log-linear models. Bunea, Tsybakov and Wegkamp 
(2007a) studies the lasso for density estimation. 

We will mainly focus on the prediction error of the estimator, that is, 
its excess risk, and not so much on selection of variables. Recent work 
where the lasso is invoked for variable selection is Meinshausen (2007), 
Meinshausen and Buhlmann (2006), Zhao and Yu (2006), Zhang and Huang 
(2006) and Meinshausen and Yu (2007). As we will see, we will obtain in- 
equalities for the l\ distance between 8 n and the "oracle" 6>* defined below 
[see (6) and (9)]. These inequalities can be invoked to prove variable selec- 
tion, possibly after truncating the estimated coefficients. 

We extend results from the papers [Loubes and van de Geer (2002), 
van de Geer (2003)], in several directions. First, the case of random design 
is considered. The number of variables m is allowed to be (much) larger 
than the number of observations n. The results are explicit, with constants 
with nonasymptotic relevance. It is not assumed a priori that the regression 
functions fe G T are uniformly bounded in sup norm. Convergence of the 
estimated coefficients 6 n in i\ norm is obtained. And finally, the penalty is 
based on the weighted i\ norm of the coefficients 9, where it is allowed that 
the base functions ipk are normalized with their empirical norm <7fc. 

The paper is organized as follows. In the remainder of this section, we 
introduce some notation and assumptions that will be used throughout 
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the paper. In Section 2, we first consider the case where the weights := 
(Eifi^X)) 1 / 2 are known. Then the results are more simple to formulate. It 
serves as a preparation for the situation of unknown o^. We have chosen to 
present the main results (Theorems 2.1 and 2.2) with rather arbitrary values 
for the constants involved, so that their presentation is more transparent. 
The explicit dependence on the constants can be found in Theorem A. 4 (for 
the case of known ak) and Theorem A. 5 (for the case of unknown cifc). 

Section 3 discusses the Assumptions L, A, B and C given below, and 
presents some examples. Some extensions are considered, for instance hinge 
loss, or support vector machine loss, (which may need an adjustment of 
Assumption B), and quadratic loss (which is not Lipschitz). Moreover, we 
discuss the case where some coefficients (for instance, the constant term) 
are not penalized. 

The proof of the main theorems is based on the concentration inequal- 
ity of Bousquet (2002). We moreover use a convexity argument to obtain 
reasonable constants. All proofs are deferred to the Appendix. 

We use the following notation. The empirical distribution based on the 
sample {(Xi,Yi)}f =1 is denoted by P n , and the empirical distribution of the 
covariates {Aj}" =1 is written as Q n . The distribution of X is denoted by Q. 
We let a\ := Q4>k an d &\ '■= Qn^\, k = 1, . . . , m. The ^(Q) norm is written 
as || • || . Moreover, || • denotes the sup norm. 

We impose four basic assumptions: Assumptions L, A, B and C. 

Assumption L. The loss function 7y is of the form 7/(x, y) = r y{f{x), y) + 
&(/), where b(f) is a constant which is convex in /, and 7(-,y) is convex for 
all y & y. Moreover, it satisfies the Lipschitz property 

h(fe(x),y)-j(f~ e (x),y)\ < \f e (x) - f s (x)\ 



Note that by a rescaling argument, there is no loss of generality that 
Assumption L takes the Lipschitz constant equal to one. 



Assumption A. It holds that 

K m := max " T """^ < 00. 

l<fc<m (J/j 



Assumption B. There exists an 77 > and strictly convex increasing G, 
such that for all G with \\fg — /||oo < ij, one has 



£(fe)>G(\\fg-f\\). 
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Assumption C. There exists a function D(-) on the subsets of the index 
set {1, . . . , m}, such that for all K, C {1, . . . , m}, and for all £ and 9 6 0, 
we have 



J2< J k\0k-e k \<^D(tc)\\f e -f § \\. 

The convex conjugate of the function G given in Assumption B is denoted 
by H [see, e.g., Rockafeller (1970)]. Hence, by definition, for any positive u 
and v, we have 

uv<G(u)+H(v). 

We define moreover for all 9 € 0, 

Dg:=D({k:\e k \^0}), 

with D(-) given in Assumption C. 
We let 



21og(2m) log(2m) 
(1) a n = 4a n , a n := W 1 A, 



n n 

with K m the bound given in Assumption A. We further let for t > 0, 
(2) A ra , := An,o(*) ■= «n (l + ^ 2(1 + 2a n K m ) + 2t2a ^ m 

and 

2t (i n K rn 



(3) A n , := A n , (i) := a„ (^1 + ^2(1 + 2a n K m ) + 

The quantity A nj o will be a lower bound for the value of the smoothing 
parameter X n . Its particular form comes from Bousquet's inequality (see 
Theorem A.l). The choice of t is "free." As we will see, large values of t may 
give rise to large excess risk of the estimator, but more "confidence" in the 
upper bound for the excess risk. In Section 3.2, we will in fact fix the value 
°f A nj o(i), and the corresponding value of t may be unknown. The latter 
occurs when K m is unknown. 
Let 

m 

1(6) :=5> fc |0 fc |. 

k=l 

We call 1(9) the (theoretical) l x norm of 6, and 1(9) = YJk=l°k\6k\ its em- 
pirical l\ norm. Moreover, for any 9 and 9 in 0, we let 

h(9\9):= £ a k \9 k \, I 2 (9\9) := I (9) - h(9\9) . 



G 
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Likewise for the empirical versions: 

h(6\§):= °k\0k\, i 2 {e\e):=i{e)-h{6\e). 

2. Main results. 

2.1. Nonrandom normalization weights in the penalty. Suppose that the 
<jfc are known. Consider the estimator 

(4) 6 n : = argmin{P n7/e + \ n I(6)}. 

6»ee 

We now define the following six quantities: 

(1) A n := 2A n) o, 

(2) V :=H(AX n ^lT e ), 

(3) e* n := arg min ee0 {£ (fe) + V e }, 

(4) 2eJ l :=3£(f e *J + 2Vg* n , 

(6) 0(e* ) := & Tgmm eeeAe _ e ^ 6Q {£(f e ) - 4A r Ji(0 - 9*\9* n )}. 

These are defined "locally," that is, only for this subsection. This is be- 
cause we have set some arbitrary values for certain constants involved. In 
Section 2.2, the six quantities will appear as well, but now with other con- 
stants. Moreover, in the Appendix, we define them explicitly as function of 
the constants. 



Condition I. It holds that \\fg* — f\\oo < rj, where rj is given in As- 
sumption B. 

Condition II. It holds that \\fe(e* n ) — /||oo < where r\ is given in 
Assumption B. 

Theorem 2.1. Suppose Assumptions L, A, B and C, and Conditions I 
and II hold. Let X n , e* and be given in (l)-(6). Assume o~k is known 
for all k and let n be given in (4) . Then we have with probability at least 

1 — 7exp[— na^t 2 ], 

that 

(5) £(f§ n )<K, 
and moreover 



(6) 



2I(§ n -e* n )<7Q 
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We now first discuss the quantities (1)— (6) and then the message of the 
theorem. To appreciate the meaning of the result, it may help to consider the 
"typical" case, with G (given in Assumption B) quadratic [say G{u) = u 2 /2], 
so that H is also quadratic [say H(v) =v 2 /2], and with D(fC) (given in 
Assumption C) being (up to constants) the cardinality \1C\ of the set index 
set K, (see Section 3.1 for a discussion). 

First, recall that A n in (1) is the smoothing parameter. Note thus that A n 
is chosen to be at least (2x )(4x ) a/2 log ( 2m )/ra. 

The function Vq in (2) depends only on the set of nonzero coefficients 
in 9, and we refer to it as "estimation error" (in a generic sense). Note 
that in the "typical" case, Ve is up to constants equal to A^dimg, where 
dimg := \{k : 9^ ^ 0}|, that is, it is of order logm x dimg/n. 

Because 9* n in (3) balances approximation error £(fo) and "estimation 
error" Ve, we refer to it as the "oracle" (although in different places, oracles 
may differ due to various choices of certain constants). 

The terminology "estimation error" and "oracle" is inspired by results 
for the "typical" case, because for estimating a given number, dim* say, of 
coefficients, the estimation error is typically of order dim^/n. The additional 
factor logm is the price for not knowing which coefficients are relevant. 

Because we only show that Ve is an upper bound for estimation error, our 
terminology is not really justified in general. Moreover, our "oracle" is not 
allowed to try a different loss function with perhaps better margin behavior. 
To summarize, our terminology is mainly chosen for ease of reference. 

The quantity e* in (4) will be called the "oracle rate" (for the excess 
risk). Moreover, £* in (5) will be called the "oracle rate" for the t\ norm. 
The latter can be compared with the result for Gaussian regression with 
orthogonal (and fixed) design (where m < n). Then the soft-thresholding 
estimator converges in t\ norm with rate dim* ^/logra/n with dim* up to 
constants the number of coefficients larger than \/logra/n of the target. In 
the "typical" situation, £* is of corresponding form. 

The last quantity (6), the vector #(e*), balances excess risk with "being 
different" from 9* n . Note that the balance is done locally, near values of 9 
where the i\ distance between 9 and #* is at most 6(*. 

Conditions I and II are technical conditions, which we need because As- 
sumption B is only assumed to hold in a "neighborhood" of the target. 
Condition II may follow from a fast enough rate of convergence of the ora- 
cle. 

The theorem states that the estimator with lasso penalty has excess risk 
at most twice e* . It achieves this without knowledge about the estimation 
error Vq or, in particular, about the conjugate H{-) or the function D{-). In 
this sense, the estimator mimics the oracle. 

The smoothing parameter A n , being larger than y / 21og(2m)/n, plays its 
role in the estimation error Ve* . So from asymptotic point of view, we have 



S. A. VAN DE GEER 



the usual requirement that m does not grow exponentially in n. In fact, in 
order for the estimation error to vanish, we require Dq* logm/n — > 0. 

We observe that the constants involved are explicit, and reasonable for 
finite sample sizes. In the Appendix, we formulate the results with more 
general constants. For example, it is allowed there that the smoothing pa- 
rameter X n is not necessarily twice A ni o, but may be arbitrary close to \ n ,o, 
with consequences on the oracle rate. Our specific choice of the constants is 
merely to present the result in an transparent way. 

The constant a n is, roughly speaking (when K m is not very large), the 
usual threshold that occurs when considering the Gaussian linear model 
with orthogonal design (and m<n). The factor "4" appearing in front of 
it in our definition of a n , is due to the fact that we use a symmetrization 
inequality (which accounts for a factor "2") and a contraction inequality 
(which accounts for another factor "2"). We refer to Lemma A. 2 for the 
details. 

In our proof, we make use of Bousquet's inequality [Bousquet (2002)] for 
the amount of concentration of the empirical process around its mean. This 
inequality has the great achievement that the constants involved are econom- 
ical. There is certainly room for improvement in the constants we, in our 
turn, derived from applying Bousquet's inequality. Our result should, there- 
fore, only be seen as indication that the theory has something to say about 
finite sample sizes. In practice, cross validation for choosing the smoothing 
parameter seems to be the most reasonable way to go. 

Finally, Theorem 2.1 shows that the l\ difference between the estimated 
coefficients 9 n and the oracle #* can be small. In this sense the lasso performs 
feature selection when the oracle rate is sufficiently fast. When R(f) is not 
small, the convergence of I(9 n — 6^) is perhaps more important than fast 
rates for the excess risk. 

2.2. Random normalization weights in the penalty. In this subsection, 
we estimate a\ := Qipk by &\ '■= Qntp^ k = 1, . . . ,rn. A complication is that 
the bound K m will generally also be unknown. The smoothing parameter 
A n depends on K m , as well as on t. We will assume a known but rough 
bound on K m (see Condition III' below). Then, we choose a known (large 
enough) value for A n) o, which corresponds to an unknown value of t. This is 
elaborated upon in Theorem 2.2. 

Recall the estimator 

(7) 9 n = argmin{P ri 7 /e + \J(6)}. 

6»ee 

In the new setup, we define the six quantities as: 
(1)' \i '■ = 3A nj o, 
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(2) ' V e :=H(5X n VD~e), 

(3) ' 9* := argmin* 6e {£(/*) + V e }, 

(4) ' 2e* :=3£(/ ,) + 2V e *, 

(5) ' C:=4/An 5 o. 

(6) ' := argmin ee e, /{e _ e *)< 6C *{£(/ e ) - 5A„ii(0 - 0*|0£)}. 
Note that (l)'-(6)' are just (l)-(6) with slightly different constants. 

Condition I'. It holds that \\fg* — /||oo < n, with n given in Assump- 
tion B. 

Condition II'. It holds that \\feu*) — f\\oo < f], with i] given in As- 
sumption B. 

Condition III'. We have ^ l ° s ^ m) K m < 0.13. 

The constant 0.13 in Condition III' was again set quite arbitrary, in the 
sense that, provided some other constants are adjusted properly, it may be 
replaced by any other constant smaller than (\/6 — V2)/2. The latter comes 
from our calculations using Bousquet's inequality. 

Theorem 2.2. Suppose Assumptions L, A, B and C, and Conditions 
I', II' and III' are met. Let X n , e* and (* be given in (l)'-(6)' and let 
9 n be given in (7). Take 

Vo>J«^x(1.6). 
V n 

Then with probability at least 1 — a, we have that 

(8) £ (f§ n )<K, 

and moreover 

(9) 2i(§ n -e; l )<7C- 

Here 

(10) a = exp[— na 2 n s 2 ] + 7exp[— na^t 2 ], 

with s > being defined by | = K m \ n $(s) , and t > being defined by A nj o = 
^n,o(t)- 

The definition of A ni o(s) [A n) o(t)] was given in (2) [(3)]. Thus, Theorem 
2.2 gives qualitatively the same conclusion as Theorem 2.1. 

To estimate the "confidence level" a, we need an estimate of K m . In 
Corollary A. 3, we present an estimated upper bound for a. An estimate of 
a lower bound can be found similarly. 
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3. Discussion of the assumptions and some examples. 

3.1. Discussion of Assumptions L, A, B and C. We assume in Assump- 
tion L that the loss function 7 is Lipschitz in /. This corresponds to the 
"robust case" with bounded influence function (e.g., Huber or quantile loss 
functions). The Lipschitz condition allows us to apply the contraction in- 
equality (see Theorem A. 3). It may well be that it depends on whether or 
not T is a class of functions uniformly bounded in sup norm. Furthermore, 
from the proofs one may conclude that we only need the Lipschitz condition 
locally near the target / [i.e., for those 6 with 1(9 — 0*) < 6£*]. This will be 
exploited in Example 4 where the least squares loss is considered. 

Assumption A is a technical but important condition. In fact, Condition 
III' requires a bound proportional to \]nj log(2m) for the constant K m of 
Assumption A. When for instance ip±, . . . ,ifj m actually form the co- variable 
X itself [i.e., X = (tpi(X), . . . , ip m (X)) £ R m ], one possibly has that K m < K 
where K does not grow with m. When the ipk are truly feature mappings 
(e.g., wavelets), Assumption A may put a restriction on the number m of 
base functions ^)^. 

Inspired by Tsybakov (2004), we call Assumption B the margin assump- 
tion. In most situations, G is quadratic, that is, for some constant Co, 
G(u) = u 2 /(2Co), u > 0. This happens, for example, when F is the set, 
of all (measurable) functions, and in addition jf(X,Y) = ~/(f(X),Y) and 
E{ r y(z,Y)\X) is twice differentiable in a neighborhood of z = f(X), with 
second derivative at least 1/Cb, Q-a.s. 

Next, we address Assumption C. Define the vector tp = (ipi, . . . ,ifj m ) T . 
Suppose the m x m matrix 



has smallest eigenvalue f3 2 > 0. Then one can easily verify that Assumption 
C holds with 



where \1C\ is the cardinality of the set /C. Thus, then D indicates "dimen- 
sion." Weighted versions are also relatively straightforward. Let A(JC) be the 
selection matrix: 




(11) 



DQC) = \K\/(i 2 



a\ = a T A(IC)a 



fce/c 



Then we can take 



£(/C) = 5> fc " 2 //? 2 (/C) 
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with w\, . . . ,w m being a set of positive weights, and l//? 2 (/C) the largest 
eigenvalue of the matrix S _1 / 2 Wi(/C)lf E" 1 / 2 , W := diag(uii, . . . ,w m ) [see 
also Tarigan and van de Geer (2006)]. 

A refinement of Assumption C, for example the cumulative local coher- 
ence assumption [see Bunea, Tsybakov and Wegkamp (2007a)] is needed to 
handle the case of overcomplete systems. We propose the refinement 

Assumption C*. There exists nonnegative functions p(-) and -D(-) on 
the subsets of the index set {1, . . . , m}, such that for all KL C {1, . . . , m}, and 
for all 9 G and 9 £ O, we have 

E ^\9 k -~9 k \< P {K)I{6 - 9)/2 + y/DOCjWfe - f § \\. 

keK. 

With Assumption C*, one can prove versions of Theorems 2.1 and 2.2 
with different constants, provided that the "oracle" #* is restricted to 
having a value p({K:9* nk ^ 0}) strictly less than one. It can be 
shown that this is true under the cumulative local coherence assumption 
in Bunea, Tsybakov and Wegkamp (2007a), with D(1C) again proportional 
to \K\. The reason why it works is because the additional term in Condition 
C* (as compared to Condition C) is killed by the penalty. Thus, the results 
can be extended to situations where £ is singular. However, due to lack of 
space we will not present a full account of this extension. 

Assumption A is intertwined with Assumption C. As an illustration, sup- 
pose that one wants to include dummy variables for exclusive groups in the 
system {tpk}- Then Assumptions A and C together lead to requiring that 
the number of observations per group is not much smaller than n/K^ . 

3.2. Modifications. In many cases, it is natural to leave a given subset 
of the coefficients not penalized, for example, those corresponding to the 
constant term and perhaps some linear terms, or to co-variables that are 
considered as definitely relevant. Suppose there are p <m such coefficients, 
say the first p. The penalty is then modified to 

m 

m= E ^\9 k \. 

k=p+l 

With this modification, the arguments used in the proof need only slight 
adjustments. 

An important special case is where {V'fc} contains the constant function 
ipi = l, and 9\ is not penalized. In that case, it is natural to modify As- 
sumption A to 
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where = Qijjk and where a\ is now defined as a\ = Qty\ — fJ%, k = 2, . . . , m. 
Moreover, the penalty may be modified to 1(6) = Y^k=2®k\Qk\i where now 
P-k — Qntpk- Thus, the means fi^ are now also estimated. 
However, this additional source of randomness is in a sense of smaller order. 
In conclusion, this modification does not bring in new theoretical complica- 
tions, but does have a slight influence on the constants. 

3.3. Some examples. 

Example 1 (Logistic regression). Consider the case Y G {0, 1} and the 
logistic loss function 

7 /(x, y) = l(f(x),y) := [-f(x)y + log(l + e f ^)]/2. 

It is clear that this loss function is convex and Lipschitz. Let the target 
/ be the log-odds ratio, that is / = log(y^r). where n(x) := E(Y\X = x). 
It is easy to see that when for some e > 0, it holds that e < tt < 1 — e, 
Q-a.e., then Assumption B is met with G(u) = u 2 /(2Cq), u > 0, and with 
constant Co depending on r\ and e. The conjugate H is then also quadratic, 
say H(v) = C\v 2 , v > 0, where C\ easily can be derived from Co- Thus, for 
example, Theorem 2.2 has estimation error Vg = 25CiA^Z?e. 

Example 2 (Density estimation). Let X have density qo := dQ/du with 
respect to a given <r-finite dominating measure v. We estimate this density 
by 

q n = exp[f §n -b(f § J], 

where 9 n is the lasso-penalized estimator with loss function 

V (X):=-f(X) + b(f), f€F. 

Here F = { / : / & dv < oo} and b(f) = log / & dv is the normalization con- 
stant. There is no response variable Y in this case. 

Clearly 7j satisfies Assumption L. Now, let fg = J2T=i Ok^k, with 6 € 
and 6 = {9 € R m : fe G F}. If we assume, for some e > 0, that e < qo < 1/e, 
/x-a.e, one easily verifies that Assumption B holds for G(u) = u 2 / (2Co), u > 
0, with Co depending on e and r\. So we arrive at a similar estimation error 
as in the logistic regression example. 

Now, a constant term will not be identifiable in this case, so we do not 
put the constant function in the system {ipk}- We may moreover take the 
functions ipk centered in a convenient way, say / tpi~ dv = 0, V k. In that situa- 
tion a natural penalty with nonrandom weights could be J2T=i w k\Sk\i where 
w k = I ' V'jfc dv, k = 1, . . . ,m. However, with this penalty, we are only able to 
prove a result along the lines of Theorem 2.1 or 2.2, when the smoothing 
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parameter A n is chosen depending on an estimated lower bound for the den- 
sity qo. The penalty 1(9) with random weights = (Q n iffy 1 / 2 , k = 1, . . . , m, 
allows a choice of A n which does not depend on such an estimate. 

Note that in this example, r y(f(X),Y) = —f(X) is linear in /. Thus, in 
Lemma A. 2, we do not need a contraction or symmetrization inequality, 
and may replace a n (A n> o) by a n (A ni o) throughout, that is, we may cancel 
a factor 4. 

Example 3 (Hinge loss). Let Y G {±1}- The hinge loss function, used 
in binary classification, is 

V (X,Y):=(l-Yf(X)) + . 

Clearly, hinge loss satisfies Assumption L. Let ir(x) := P{Y = 1\X = x) and 
F be all (measurable) functions. Then the target / is Bayes decision rule, 
that is / = sign(27r — 1). Tarigan and van de Geer (2006) show that instead 
of Condition B, it in fact holds, under reasonable conditions, that 

£(fe)>G(\\f e -f\\l /2 ), 

nl/2 

with G being again a strictly convex increasing function, but with || • 
being the square root L\(Q) norm instead of the L2(Q) norm. The different 
norm however does not require essentially new theory. The bound ||/ — 

/II < ^||/ — /Hi (which holds if ||/ — /||oo < v) now ties Assumption B to 
Assumption C. See Tarigan and van de Geer (2006) for the details. 

Alternatively, we may aim at a different target /. Consider a minimizer 
fg over J- = {fe'-G £ ©}, of expected hinge loss E(l — Y f(X))+. Suppose 
that sign(/g) is Bayes decision rule sign(27r — 1). Then fg is a good target 
for classification. But for / = fg, the margin behavior (Assumption B) is as 
yet not well understood. 

Example 4 (Quadratic loss). Suppose Y = f(X) +e, where e is Af(0, 1) 
distributed and independent of X. Let 7/ be quadratic loss 

V (X,Y) :=\{Y-f(X)f. 

It is clear that in this case, Assumption B holds for all /, with G(u) = u 2 /2, 
u > 0. However, the quadratic loss function is not Lipschitz on the whole 
real line. The Lipschitz property was used in order to apply the contraction 
inequality to the empirical process [see (15) of Lemma A. 2 in Section 4.2]. To 
handle quadratic loss and Gaussian errors, we may apply exponential bounds 
for Gaussian random variables and a "local" Lipschitz property. Otherwise, 
one can use similar arguments as in the Lipschitz case. This leads to the 
following result for the case {afc} known. (The case {<7k} is unknown can be 
treated similarly.) 
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Theorem 3.1. Suppose Assumptions A and C hold. Let X n , 9* n , e* and 
C* be given in (l)-(6), with H(v) = v 2 /2, v > 0, but now with A ni o replaced 
by 



~ [U / 2 log (2m) T 

A «>o := V 7T V " + 2t < + Vo- 

V 9 V n 

Assume moreover that \\fg* — /||oo <V — 1/2? 6(*K m + 2r] < 1, and f/iai 
/ loggm) "^ ^ 0.33. Let 0"^ 6e known for all k and let 9 n be given in (4). 
iTten we Ziaue probability at least 1 — a, t/tai 

(12) £0U<2C 

and moreover 

(13) 2J(4-e*)<7C. 
ffere 

a = exp[— na^s 2 ] + 7exp[— na^t 2 ], 
with s > a solution of | = i^ m A ni o(s). 

We conclude that when the oracle rate is small enough, the theory es- 
sentially goes through. For example, when K m = 0(1) we require that the 
oracle has not more than 0{\/n/ \ogn) nonzero coefficients. This is in line 
with the results in Bunea, Tsybakov and Wegkamp (2007b). We also refer to 
Bunea, Tsybakov and Wegkamp (2007b) for possible extensions, including 
non-Gaussian errors and overcomplete systems. 



APPENDIX 

The proofs of the results in this paper have elements that have become 
standard in the literature on penalized M-estimation. We refer to Massart 
(2000b) for a rather complete account of these. The l\ penalty however 
has as special feature that it allows one to avoid estimating explicitly the 
estimation error. Moreover, a new element in the proof is the way we use the 
convexity of the penalized loss function to enter directly into local conditions. 

The organization of the proofs is as follows. We start out in the next 
subsection with general results for empirical processes. This is applied in 
Section A. 2 to obtain, for all M > 0, a bound for the empirical process 
uniformly over {6 G : 1(9 — 8*) < M}. Here, 9* is some fixed value, which 
can be chosen conveniently, according to the situation. We will choose 9* to 
be the oracle 0*. Once we have this, we can start the iteration process to 
obtain a bound for I(9 n — #*). Given this bound, we then proceed proving a 
bound for the excess risk £(f n ). Section A. 3 does this when the are known, 
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and Section A. 4 goes through the adjustments when the dk are estimated. 
Section A. 5 considers quadratic loss. 

Throughout Sections A.1-A.4, we require that Assumptions L, A, B and 
C hold. 

A.l. Preliminaries. Our main tool is a result from Bousquet (2002), 
which is an improvement of the constants in Massart (2000a), the latter 
being again an improvement of the constants in Ledoux (1996). The result 
says that the supremum of any empirical process is concentrated near its 
mean. The amount of concentration depends only on the maximal sup norm 
and the maximal variance. 



Theorem A.l (Concentration theorem [Bousquet (2002)]). Let Z\,...,Z n 
be independent random variables with values in some space Z and let V be 
a class of real-valued functions on Z, satisfying for some positive constants 
Tin and T n 

<rj n V 7 G T 



171 



and 



Define 



1 



n 



t=l 



sup 

7er 



n e -z 



i=l 



Then for z > 0, 



P Z > EZ + W2(r2 + 2r? n EZ) + 



2z 2 r] r 



< exp[— nz 2 }. 



Bousquet 's inequality involves the expectation of the supremum of the 
empirical process. This expectation can be a complicated object, but one 
may derive bounds for it using symmetrization, and — in our Lipschitz case — 
contraction. To state these techniques, we need to introduce i.i.d. random 
variables ei,...,e n , taking values ±1 each with probability 1/2. Such a se- 
quence is called a Rademacher sequence. 

Theorem A. 2 (Symmetrization theorem [van der Vaart and Wellner 
(1996)]). Let Z%,...,Z n be independent random variables with values in 
Z, and let e±, . . . , e n be a Rademacher sequence independent of Z\, . . . , Z n . 
Let r be a class of real-valued functions on Z. Then 



E sup 



i=l 



< 2E sup 

\76r 



j>7(£i) 



i=l 
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Theorem A. 3 (Contraction theorem [Ledoux and Talagrand (1991)]). 
Let 

z ii ■ ■ ■ i z n be nonrandom elements of some space Z and let J- be a class 
of real-valued functions on Z. Consider Lipschitz functions ji : R — > R ; that 
is, 

|7i(s)-7i(s)|<|s-s| Vs,s£R. 

Let £i, . . . , e n be a Rademacher sequence. Then for any function f* : Z — > R, 
we have 



E sup 



E £ i{7i(/(^))-7i(r(^))} 



i=l 



< 2E ( sup 



i=l 



Now, suppose r is a finite set of functions. In that case, a bound for the 
expectation of the supremum of the empirical process over all 7 G T can be 
derived from Bernstein's inequality. 

Lemma A.l. Let Zi, . . . ,Z n be independent Z-valued random variables, 
and 71, ... , 7 m be real-valued functions on Z, satisfying for k = 1, . . . , m, 

1 n 

E 7fc (Z i ) = 0,Vi ||7fc||oo<??n, -E E 7fc(Zi)<^- 



i=l 



Then 



E max 

\ Kfc<m 



-E^( z * 

n r— i 



i=i 



< 



'2t% log (2m) r? n log(2m) 



+ 



Proof. Write % '■= ^ XX=i Jk(Zi), k = l,...,m. A classical intermedi- 
ate step of the proof of Bernstein's inequality [see, e.g., van de Geer (2000)] 
tells us that for n/f3>r/ n , we have 

,2(n-P Vn ), 

The same is true if we replace 7^ by — 7&. Hence, 



Eexp(/?7 fc ) < exp 



E[ max |7fc| ) < — log [ E exp ( (5 max ±7^ 



\l<fc<m 



/9 



< log(2m) 



2(n - /3?7n) 



Now, take 



77 




77n + 



mi, 



2 log (2m)' 



□ 
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A.2. First application to M-estimation with lasso penalty. We now turn 
to our specific context. We let e\ , . . . , e n be a Rademacher sequence, indepen- 
dent of the training set (Xi,Y\), . . . , (X n ,Y n ). Moreover, we fix some 6* G 
and let for M > 0, T M ■= {fe-6€ 6, 1(6 - 6*) < M} and 



(14) 



Z(M):= sup \( Pn -P)( Ve - Vgt )\, 



where ^f(X,Y) = ^f(f(X),Y) + b(f) now denotes the loss function. 
Lemma A. 2. We have 



EZ(M) <4ME( max 



Kk<m 



1 



n 



i=l 



Proof. By the Symmetrization Theorem, 



EZ(M)<2E[ sup - J2*Mfo{Xi)M)-'Y(fe*(Xi),Yd}). 

Now, let (X, Y) = {(Xi,Yi)}f =1 denote the sample, and let Epc,Y) denote the 
conditional expectation given (X, Y). Then invoke the Lipschitz property of 
the loss functions, and apply the Contraction Theorem, with zi := Xi and 
7i(-) := j(Y i} •), i = 1, ...,n, to find 



(15) 



E (X,Y) SUp 



1 " 

-E e i{7(/eTO,^)-7(/fl'W,yi)} 



i=i 



<2E (X ,Y) sup 



n 



- f g .(Xi)) 



But clearly, 



1 n 

-Y,e i (fe(X l )-f e *(X i )) 

1=1 



< Yj^klOk ~0* k \ max 

, — ; Kk<rr 



k=l 



1(6-6*) max 

Kk<m 



^eitp k (Xi)/a k 



i=i 



-y2siipk(Xi)/cr k 
n r - ; 



i=l 



Since for /# G .Fm, we have 1(6 



< M, the result follows. □ 



Our next task is to bound the quantity 

\l/nJ2? =1 SiMXi 



E( max 

.Kk<m 



a i, 



Recall definition (1): 



'21og(2m) log(2m) 
1 A m 



n 



n 
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Lemma A. 3. We have 



E( max 

. Kk<m 



0~k 



as well as 



/ ll/nl^^WIN 

\l<fc<m Cfc / 

Proof. This follows from \\ipk\\oo / (?k < K m and var(^> fe (X))/a| < 1. So 
we may apply Lemma A.l with r\ n = K m and r 2 = 1. □ 

We now arrive at the result that fits our purposes. 

Corollary A.l. For all M > and all 9 G 9 urc'i/i /(0 - 0*) < M, i< 
/io/ds i/iaf 

Il7/e ~~ 7/ e * lloo < MK m 

and 

Phfe-lf g ») 2 <M 2 . 
Therefore, since by Lemmas A. 2 and A.3, for all M > 0, 

EZ(M) 

^ o n , o n — 40.J2, 

we /lave, in wiew o/ Bousquef s Concentration theorem, for all M > and 
a// £ > 0, 

f2; 



P (Z(M) > a n M (l + t^2(l + 2a n iT m ) + 2t a ^ m ) ) < expf-na^ t 2 ] . 

A.3. Proofs of the results in Section 2.1. In this subsection, we assume 
that cifc are known and we consider the estimator 

6 n : = argmin { - V 7/e (X^) + \ n I{9) 1 . 
see [n. =1 J 



Take 6 > 0, d > 1, and 



Let 



(Al) A n :=(l + &)A n , 

(A2) V e := 25H( 2Xn / U ° ), where < 5 < 1, 
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(A3) 0* := argmin 06 e{£(/<O + V e }, 
(A4) <Z:=(l + 6)S(fa)+Ve., 
(A5) Cn ■= £~ , 

(A6) 9(e* n ) :=argming ee ,i(e-0z)<d b t*Jb{5£(fg) " 2 VM0 - «)}• 
Condition l(b,5). It holds that \\fg* - f\\oo <v- 
Condition ll(b,d,5). It holds that \\fgf e *) — /||oo <V- 

In this subsection, we let A n , Vg, 0*, e*, Cn an d 9(e* n ) be defined in (Al)- 
(A6). Theorem 2.1 takes 6 = 1, 8 = 1/2 and d = 2. 

We start out with proving a bound for the £j norm of the coefficients 9, 
when restricted to the set where 9* n k ^ 0, in terms of the excess risk £(fg). 

Lemma A. 4. Suppose Condition 1(6, <5) and Condition II(b,8,d) are met. 
For all9£0 with 1(0 - 0*) < d b Q/b, it holds that 

2X n I x (0 - 0*\0* n ) < S£(f e ) + e* - £(fg*J. 

Proof. We use the short-hand notation h(9) = Ii(9\9* n ), 6> £ 6. When 
1(0 - 9* n ) < dbQ/b, one has 

2X n h(9 - 9* n ) = 2X n h(0 - 0* n ) - 5£(fg) + 5£(fg) 

< 2X n h(0(e* n ) - 91) - SStfe^) + S£(fg). 

By Assumption C, combined with Condition 11(6, 8, d), 

2X n I x (0(el) - 0* n ) < 2X n ^\\f e{t , n) - fg*J. 
By the triangle inequality, 

2A n y / Ad| " foil < ^nsjDgZ\\fe { et) - f\\ + 2A n /Ad|/0» - /||. 

Since \\fg( £ ») — f\\ < rj as well as \\fg* — f\\ < T), it follows from Condition 
1(6,5) and Condition 11(6, 8,d), combined with Assumption B, that 

2A n ^||/ 0(e , ) -/ e *|| <8£(fg {€ * n) ) + 8£(fg*)+Vg*. 

Hence, when 1(0 - 0* n ) < d b (*/b, 

2X n h(9 - 9* n ) < 8£(fg) + 8£(fg*) + Vg* = 8£(fg) + e* n - £(fg*). □ 

We now show that for any 9 for which the penalized empirical risk is not 
larger than that of the oracle 0* , the t\ difference between 9 and (9* has to 
be "small." Lemma A. 5 represents one iteration, which shows that on the 
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set 1(9 — 0*) < do(*/b, in fact, except on a subset with small probability, 
1(9 — 0*) is strictly smaller than do£*/b. 

Let for M > 0, Z(M) be the random variable defined in (14), with 0* = 0*. 

Lemma A. 5. Suppose Condition 1(6, 5) and Condition 11(6, 6, d) are met. 
Consider any (random) G with R n (fg) + A n J(0) < R n (fg^) + A n /(0*). 
Lei 1 < do < <4 • TTien 

p(/(0-0;)<d o f) 

< P - < (^) f ) + exp[-na^ 2 ]. 

Proof. We let £ :=£(f§) and £* :=£(/#») . We also use the short 
hand notation h(9) = h(9\9* n ) and J 2 (0) = ijj^K)- Since R n (f § ) + X n I(9) < 
Rn(fe* n ) + KI(9n), we know that when 1(9 - 0*) < d C/&, tnat 

£ + A n /(0) < Z(4C/6) + S* + X n I(9* n ). 

With probability at least 1 — exp[— na^t 2 ], the random variable Z(doC*/b) 
is bounded by X n fido(*/b. But then we have 

£ + A n J(0) < A n , ^ +£* + X n I(9* n ). 

Invoking A n = (1 + &)A n ,o, 10) = h(9) + h(9) and 1(9*) = h(9*), we find 
on {1(9 - 9* n ) < d (* n /b}U {Z(d Q/b) < XnfldoQ/b}, that 

£ + (l + 6)A n , o / 2 (0) 

< An,o^ + £* + (1 + b)\ nfi h(9* n ) - (1 + b)\ nfi h(9) 

< An,o^ +£* + (! + b)\ n , h(9 - 9* n ). 

But I 2 (9) = I 2 (9 - 9* n ). So if we add another (1 + b)X nfi h(9 - 9* n ) to both 
left- and right-hand side of the last inequality, we obtain 

£ + (! + 6)A n , o J(0 - 0;) < A n , ^ + 2(1 + 6)A n , o /i(0 - 0;). 
Since d$<db, we know from Lemma A. 4 that this implies 

£ + (! + b)X nfi I(9 - 0*) < A n , ^ + S£ + e* 

= (do + b)X n ,o < f + 6£, 
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as e* = A nj oCn- But then 

(1 - 5)8 + (1 + b)\ nfi I(9 - 9* n ) < (d + 6)A n , ^. 
Because < 5 < 1, the result follows. □ 

One may repeat the argument of the previous lemma N times, to get the 
next corollary. 

Corollary A. 2. Suppose Condition l(b,5) and Condition ll(b,5,d) are 
met. Let do < db- For any (random) 9 € with R n (fg) + \ n I(9) < R n (fe*) + 

^nl(0*n)i 

P(j(0-C)<^of) 

< p(/(0 - 0* n ) <(1 + (do - 1)(1 + b)~ N )^j + iVex P [-na^ 2 ]. 

The next lemma considers a convex combination of 9 n and such that 
the l\ distance between this convex combination and is small enough. 



Lemma A. 6. Suppose Condition I(b,5) and Condition II(b,5,d) are met. 
Define 

9 S = s6 n + (1 - s)6* n , 

with 



d(* + bi(e n -9* n ) 

Then, for any integer N, with probability at least 1 — iVexpf— na^t 2 ] we have 

i(9 s -e* n )<(i + (d-i)(i + br N ) C f. 

Proof. The loss function and penalty are convex, so 
MfeJ + x nl(0~s) < sRn(f§J + s\ n I(9 n ) + (1 - s)R n (f e * n ) + (1 - s)\ n I(9* n ) 
<R n (f e , n ) + X n I(9* n ). 

Moreover, 

I(9 S - 9* n ) = sl(9 n - 9* n ) = dC/(gn r 9 * n) < d&. 

dC* + bI(9 n -9* n ) b 
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By the definition of db , we know d<db- Now apply the previous lemma with 
9 = 9 S and do = d. □ 

One may now deduce that not only the convex combination, but also n 
itself is close to 0*, in i\ distance. 

Lemma A. 7. Suppose Condition 1(6, 5) and Condition 11(6, 5, d) are met. 
Let Nt £ N and JV 2 £NU {0}. Define 5 1 = (1 + b)~ Nl fiVi > 1 ), and 5 2 = 
(1 + b)~ Ni . With probability at least 1 — (N\ + N 2 ) exp[— na 2 n t 2 ], we have 

I(9 n -e*)<d(S 1 ,6 2 )^, 

with 

AIR R\ 1 1 f l + (d 2 -l)h \ x 

d((5l ^ 2) = 1+ v (d-i)(i-^) J 2 ' 



Proof. We know from Lemma A. 6 that with probability at least 1 — 
Ni expf-raa 2 * 2 ], 

m-9* n )<(i+(d-i)5 1 )^. 

But then 

Tift a * + -!)<*!) C, 

We have do < db, since < 8\ < 1/(1 + 6), and by the definition of db- There- 
fore, we may Corollary A. 2 to 9 n , with do replaced by do- We find that with 
probability at least 1 — (N\ + N 2 ) exp[— na n i 2 ], 

/(0„ -<£)<(! + (do -1)*2)§ 



i+(^-i)M t u: 

(d-l)(l-i.)i 2 i 6- □ 



Recall the notation 

^ Ao 1 ! ■= 1 4- I 
Write 

(17) A(6,<5,5i,<5 2 ) := d(£i, #2)— ?t^- V 1. 

db 
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Theorem A. 4. Suppose Condition l(b,5) and Condition 11(6, 5, d) are 
met. Let 5i and 62 be as in Lemma A.7. We have with probability at least 

1 - (^logi+f, J exp[-na n t ], 



that 



and moreover 



i(e n -o* n )<d(s 1 ,6 2 )^. 

Proof. Define £ := £(fx ) and £* := £(fe*)- We also again use the short 
hand notation I 1 (9) = li(0|0*) and 7 2 (0) = I 2 (0|0;). Set 

We consider the cases (a) c < cZ(<5i, #2) an d (b) c > g?(5i, 8 2 )- 

(a) Suppose first that c < cZ(£i,52)- Let J be an integer satisfying (1 + 
b) J ~ 1 c < d(5i,5 2 ) and (l + b) J c > d(5i,5 2 ). We consider the cases (al) cQ/b < 
I(0 n ~ e*) < d(S 1 ,6 2 )Q/b and (a2) I{9 n - 9*) < c(*/b. 

(al) If cQ/b < I(9 n - 9* n ) < d(8 1 ,8 2 )Q/b, then 

(1 + by-'cQ/b < I(9 n - 9* n ) < (1 + fe) J cC/6, 

for some j £ {1, . . . , J}. Except on set with probability at most exp[— na n t 2 ], 
we thus have that 

S + (l + b)\ nfl I(9 n ) < (1 + b)X njO I0 n - 9* n ) + £* + (1 + b)X nt0 I(9* n ). 

So then, by similar arguments as in the proof of Lemma A. 5, we find 

£<2(l + b)\ nfi I 1 (9 n -9* n ) + £*. 

Since d(<5i,<5 2 ) < <4 ; we obtain £ < e* + 5£, so then £ < j^. 

(a2) If /(0 n — 0*) < c(*/b, we find, except on a set with probability at 
most exp[— na n t 2 ], that 

(18) S + (l + b)X ni0 I(9 n ) < (j^p) VoC + f * + (1 + 6)J(^), 
which gives 



£<[ T ^2-)K,0C + £* + (l+b)\n,0h(9n 
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£ (_^„, oC + f . + |r + ^ 
4 + ik+?+^ 



1 - 5 2 2 n 2 2 



5 1 1 

< h ^ + ^ 



1-5 2 2 2(1 + 5) 



But this yields 



Z<J-(-*- + l + —L- V = — ^ 

"2-Hl-<5 2 2 + 2(l + i)/ " l-<T n ' 

Furthermore, by Lemma A. 7, we have with probability at least 1 — {N\ + 
N 2 ) exp[— na n t 2 ], that 

The result now follows from 

, +1 < logl+ V(l±^« 



and 



iVi=log 1+b (^), ^=log 1+6 (£). 



(b) Finally, consider the case c > d(<5i , ^2) • Then on the set where 
I(fin — On) — d(Si,5 2 )(*/b, we again have that, except on a subset with prob- 
ability at most exp[— na n t 2 ], 

£ + (1 + b)X n , I(6 n ) < d{8 u S 2 )^ +£* + (! + b)I(6* n ) 



as 

5b 



d{5i,5 2 ) < c 



1-5 2 ' 

So we arrive at the same inequality as (18) and we may proceed as there. 
Note finally that also in this case 

(iVi + N 2 + 1) < (iVi + N 2 + 2) 

(1 + 6) 2 



log 

lo Sl+6 



5x5 2 

5\5 2 
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□ 

Proof of Theorem 2.1. Theorem 2.1 is a special case of Theorem 
A.4, with 6 = 1, 5 = 1/2, d = 2 and <5i = 5 2 = 1/2. □ 

A.4. Proof of the results in Section 2.2. Let 

^ = {cfc/ci < o-fc < c 2 o- fc V fc}, 



where ci > 1 and c 2 = y 2cf — l/c±. We show that for some s depending on 
c\, the set J7 has probability at least 1 — exp[— na^s 2 ]. 



Lemma A. 8. We have 



E[ max 

. Kk<m 



01 



< Q n K m . 



k 

PROOF. Apply Lemma A.l with r] n = and t 2 = K^. □ 
Recall now that for s > 0, 

2 S Qi yi, Til 



K,o(s) :=a„( 1 + sy2(l + 2a n K m ) + 



Lemma A. 9. Let co > 1. Suppose that 



2 bog(M Km < v 6c o- 4 -y 2c o 

V n m ~ c 
Let ci > Co . T/ien i/iere exists a solution s > of 

(19) l-3=#mAn,o(*). 

c l 

Moreover, with this value of s, 

P(fi) > 1 -expl-na^s 2 ]. 

Proof. We first note that 



implies 



n c 
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Therefore, there exists a solution s > satisfying (19). By Bousquet's in- 
equality and Lemma A. 8, 



P 



max 

V k 



3-1 

4 



> K m \ ni0 (s)^J < expf-raa^s 2 ] 



In other words, 



P max 

V k 



But the inequality 



is equivalent to 



-2 



'„2 



2 „2n 



> 1 9 — eX P[ — na n s 



a 



> 1 



V k 



&k/ci < a < c 2 <Jk V A;. 



□ 



Recall the estimator 



n = argmin{i? n (/ e ) + \ n I{6)}. 

6»G0 



In this case, for 1 + b > ci > 1, C2 = y 2c| — 1/ci, and d > 1, and for 



dh '■= d 



(d- 1)6 



VI 



given as before, we define: 
(Al)' A n := Cl (l + 6)A re , , 



(A2)' V := 25H{ 2c2Xn / jr ° ), where < 5 < 1, 
(A3)' 0* := argmin e60 {£ (f e ) + V,}, 
(A4)' e* n :=(l + 6)£(fei) + V e * n , 
(A5)' C:=6;/A n , , 

(A6)' 0(e* ) := argmiii fle e I J(fl-ff«)<d l «/6{^(/*) " 2c 2 A n Ii(0 - 9*\6*)}. 
Note that on 0, 



and 



HP) I a < < c 2 i(0) v 



Ik(0K)/o! < h(0K) < c 2 I k (0\0n), k = 1,2, v 



Hence, we have upper and lower bounds for the estimated £i norms. This 
means we may proceed as in the previous subsection. 
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Condition 1(5, b, 01,02). It holds that 

II/**- /loo <?7. 
Condition 11(5, b, d,ci,c 2 ). It holds that ||/g( e *) — /||oo < V- 

Condition III(co). For some known constant cq > 1, it holds that 
2Vlog(2m) /nK m < (^6c 2 - 4 - ^ )/c Q . 

Lemma A. 10. Suppose Condition 1(5, b, 01,02) and Condition 11(5, b, d, 
01,02)- Let cib/(l + b — ci) < do < d^. For any (random) 9 G with i? n (/g) + 
KI(0) < RnifeO + KI(0* n ), we have 

p(i(§- f r n )<d £ 

£P (« £ (^)S) 

+ l-P(n)+exp[na 2 n t 2 }. 

Proof. This is essentially repeating the argument of Lemma A. 5. Let 
£ := £(f§) and £* := £(/<?*), and use the short hand notation Ii(6) = Ii(9\9^) 
and h{0) = h(8\0*). Likewise, h(9) = h(8\9* n ) and I 2 (9) = I 2 (9\9* n ) . 

On the set {1(6 — #*) < d^C^/b}, we have, except on a subset with prob- 
ability at most exp[— na^t 2 ], that 

£ + \J(9) < \n,o^ + £* + \J(0* n ). 

In the remainder of the proof, we will, therefore, only need to consider the 
set 

n n {i(9 - 9* n ) < doCM n \i + \J(9) < A„, ^ + s* + \J(o* n )\. 

Invoking A n = ci(l + 6)A n>0 , /(0) = A (5) + / 2 (0) and 7(0*) = h(9* n ), we find 
£ + 01(1 + 6)^,0/2(0) 

< A„, ^ + 5* + ci(l + b)X n , h(9* n ) - ci(l + b)\ n , h(6) 

< A„,o^ +£* + ci(l + 6)A n , o /i(0 - 0;). 

We add another ci(l + b)\ n fili(9 — 6>*) to both left- and right-hand side of 
the last inequality, and then apply I\(9 — 6>*) < c 2 I\(9 — 0*), to obtain 

£ + c l (l + b)\ nfl i(9-9* n ) 
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< K,o^+S* + 2ci(l + 6)^,0/1(0 " 0* n ) 

< A„,o^ +£* + 2ci(l + b)c 2 X n , h(9 - 

Now /(6 1 — 0*) < doCn/b < dbCn/b. We can easily modify Lemma A. 5 to the 
new situation, to see that 

2ci(l + 6)02^,0 <Vo*+5£ + 6S 

= 5S + C 

So 

£ + d(l + &)A n , /(# - 0*) < An,0^ + ^ + e* n 

= (do + b)\ n , £ + 5S. 

But then, using 1(9 - 6* n ) > 1(9 - 9*)/ci, and < 5 < 1, we derive 

(1 + &)A n , o I(0 - O < ci(do + &)A n ,oy • □ 

Recall the definitions 

* i: =(l/(l + 6))*\ 6 2 :=(l/(l + b)f* 
and, as in (16) and (17), 

1 + (d 2 - l)5 x 



d(8 1 ,5 2 ):=l + 
and 



(l-d)(l-5 1 ) 



A(b,6,S 1 ,5 2 ):=d(5 1 ,5 2 )^—^Vl. 

db 

Theorem A. 5. Suppose Conditions l(b,6,ci,c 2 ), ll(b,5,d,ci,c 2 ) and 
III(cq) are met, with c 2 = J 2c\ — 1 and c\ > cq > 1. Take 



Then there exists a solution s > of 1 — 1/cf = K m X n fi(s) , and a solution 
t> of X n fi = X n ,o(t)- We have with probability at least 1 — a, with 

a = exp[-na n s J + I log 1+6 — I exp[-na n t J, 
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that 




-6' 



and moreover 



I0 n -eZ)<d(5 u 5 2 )f. 



Proof. Since c\ > cq, there exists a solution s > of 1 — 1/cf = K m \ n> o(s). 
By Lemma A. 9, the set Q has probability at least 1 — exp[— ria^s 2 ]. We may 
therefore restrict attention to the set £1. 

We also know that 



Hence there is a solution t > of A n ,o = A n> o(i). 

This means that the rest of the proof is similar to the proof of Theorem 
A.4, using that on Q, Ii(6\6*)/ Cl < Ii{9\6* n ) < c 2 Ii{9\6^), for 1 = 1,2. □ 

Proof of Theorem 2.2. Again, in Theorem 2.2 we have stated a 
special case with b = 1, c\ = 3/2, 5 = 1/2, d = 2, 8\ = 6% = 1/2. We moreover 
presented some simpler conservative estimates for the expressions that one 
gets from inserting these values in Theorem A. 5. □ 

When K m is not known, the values s and t in Theorem A. 5 are also not 
known. We may however estimate them, as well as a n (and a n = 4a n ) to 
obtain estimates of the levels exp[— na^s ] and exp[— na„ t 2 }. Define 







Let 



K m := max 

Kk<m 



ll^fclloc 



Note that on Q, 



K m /c 2 < K m < c\K m . 
Condition III(ci,C2) is somewhat stronger than Condition III(co). 



30 S. A. VAN DE GEER 

Condition III(ci,C2). For some known constant c\ > 1 and for c 2 
2c\ — 1/ci , it holds that 



. / log(2m) ^ ^6c 2 -4-^2c 2 

M K m < 2 • 

V n c(c 2 

Lemma A. 11. Assume Condition III(ci,C2). On £1, there exists a solu- 
tion s < s of the equation 



1 - 3 = C2K m a n (c2Kn 



(20) 



C 



x I l + sy2(l + 2c 2 K m an(c2-fir m )) + — J 



Take 



A n ,0 > 44 



; log(2m) / ^ | y6cj-4-^2cj 
n \ 2cfc 2 



TTien i/zere is a solution t <t of the equation 

2c2k m a n (c 2 k m )t' 



A„,o = a n (c 2 -fT m ) ( 1 + ty 2(1 + 2c2i^ m a„(c2K m )) 4 
Proof. We have 



jfe < etoM&K. < V^- 4 -V^, 
V n V n c i c 2 

Hence 



* log(2m) „ V6cf- 4-y2j 

2c 2 iv m y < . 

V n ci 

So 

C2K m a n (c 2 -fsr m ) < 1 o- 

q 

Hence there exists a solution < s < s of (20). The solution t exists for 
similar reasons. □ 

Corollary A. 3. Suppose Condition III(ci,C2). Let the level a be de- 
fined in Theorem A. 5 and s and t be given in Lemma A.ll. Define the 
estimated level 

a u = exp[— na 2 (K m /ci)s 2 ] 

( (l + b) 2 A(b, SjuS2) \ _ 2 ~ -j 

+ ^logi+t J exp[-na n (K m /ci)t ]. 
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Then with probability at least 1 — exp[— na^s 2 ], it holds that a u > a. 

Remark. Using similar arguments, one obtains an estimate a such 
that a L < a with probability at least 1 — exp[— na^s 2 ]. 



A.5. Proof of the result of Example 4. 

Lemma A. 12. Let Z(M) be the random variable defined in (14). Assume 
that \\fg* — /|| oo <1 and that MK m + 2r] < C3. Then we have 

P({Z(M) > \ nfi M} nfi)< 2exp[-na 2 n t 2 }, 



where 



A n ,o := A n ,o(i) := oi\ 



/21og(2m) „ x , , 
2^ + 2t 2 + c 3 A n ,o * • 



nai 



Proof. Clearly. 
1 



1=1 



<I(9*-0) max 

Kfe<m 



1 n 

-22sii) k {Xi) 

" ;=i 



Because the errors are A/"(0, 1), we get for 



a = c 2 a„i 



' 21og(2m) + ^ 2 
rea 2 



that 



n ~ 



i=l 



>a>flO < 2m exp 



2? 



exp[— na n t ]. 



Moreover, the function x 1— > x 2 /(2c3) is Lipschitz when |x| < 1. Since 
H/0 + f\\oo < 277 + MK m < C3, we can apply Corollary A.l to find that with 
probability at least 1 — exp[— raa 2 i 2 ], we have 

sup i|(Q n - Q)((f e - f) 2 - {fe* - f) 2 )\ < c 3 MX nfi (t). □ 
fe^M 

Proof of Theorem 3.1. By Lemma A. 9, 

P(fi) > 1 -exp[-na 2 s 2 ]. 

Using Lemma A. 12 and applying the same arguments as in the proof of 
Theorem A. 4, the result follows for general constants b, 5, d, 5\ and 62 and 



constants ci, c 2 = y 2c 2 — l/c\ and C3. Theorem 3.1 takes 6 = 1, (5 = 1/2, 
d = 2, Si = 6 2 = 1/2 and c x = 3/2, c 2 := V 14 /9 and c 3 = 1. □ 
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